Background: The precise anatomic location and extent of most venous thromboembolism (VTE) is not captured through diagnostic codes and is a major limitation to research. Correctly identifying the specific site of thrombosis, whether proximal or distal deep vein thrombosis (DVT) or superficial vein thrombosis (SVT) is critical.An accurate natural language processing (NLP) tool would make analysis of large datasets from electronic medical records possible and could be an significant improvement compared to analyses using the International Classification of Diseases codes. Using an open-source NLP tool, we evaluated existing algorithms and then created and validated new NLP algorithms to classify lower extremity thrombosis as not only a DVT or SVT, but also to identify more precise anatomic locations.

Methods: A random sample of deidentified ultrasound reports were extracted from electronic medical records, manually reviewed, and classified as either positive or negative for a DVT or SVT (any chronicity). Reports with DVT were further classified into proximal (popliteal, femoral, deep femoral, common femoral, iliac veins, or vena cava) and distal (posterior tibial, anterior tibial, peroneal, soleal, or gastrocnemius veins) DVT. SVT was further classified into greater saphenous vein, small saphenous vein, or other site. Thrombosis near the junction of deep and superficial veins was only designated as involving specific vein segments if there was intraluminal thrombosis at that site. Initial sets of 100 ultrasound reports were used for derivation and reiterative testing of the algorithm, consisting of 50 that were positive for DVT/SVT and 50 that were negative for DVT/SVT. Text from radiology reports underwent cleaning to remove extraneous punctuation, text outside of the "findings" and "impression" section, and all text matches for "superficial femoral vein" were replaced with "femoral vein". Using target phrases, the simple NLP tool either classified the reports as present or absent, signifying either positive or negative for thrombosis at the site of interest. After maximizing accuracy in the derivation cohorts, each NLP algorithm was tested in the complete, manually reviewed dataset. Sensitivity (Sn) and specificity (Sp) were calculated and confidence intervals were determined by the binomial exact method.

Results: A total of 1206 ultrasounds were reviewed, among which there was 687 positives for DVT (503 had proximal DVT and 378 had distal DVT involvement). A total of 176 had SVT, 114 with involvement of the great saphenous vein (GSV) and 65 with involvement of the small saphenous vein (SSV). The previously published NLP (designed to identify DVT) had a poor Sn 45.0%, but reasonable Sp (91.0%) for DVT. Among the incorrect positive determinations (n=47), 33 were incorrectly positive due to presence of SVT (in the absence of DVT). Our newly developed NLP algorithm correctly identified DVT at any site (Sn: 96.2% 95% CI 94.5-97.5; 661/687 and Sp: 93.8% 95% CI 91.4-95.7; 487/519). Among ultrasounds positive for DVT, the NLP algorithm to determine proximal DVT site had a 96.0% Sn (95% CI 93.9-97.6; 483/503) and a 97.8% Sp (95% CI 94.5-99.4; 179/183). The algorithm for distal DVT had a 96.8% Sn (95% CI 94.5-98.4; 366/378) and a 96.1% Sp (95% CI 93.3-98.0; 296/308). The algorithm to identify SVT was also highly accurate (Sn 97.7%, 95% CI 94.3-99.4; 172/176 and Sp 95.5%, 95% CI 94.1-96.7; 984/1030). Among ultrasounds positive for SVT, the Sn to determine GSV site was 94.7% (95% CI 88.9-98.0; 108/114) and Sp was 95.1% (95% CI 86.3-99.0; 58/61). The algorithm to identify SSV site had a Sn of 98.5% (95% CI 91.7-99.9; 64/65) and a Sp of 100% (95% CI 96.7-100.0; 110/110).

Conclusion: Using a previously published, open source, simple NLP program, we have improved the sensitivity and specificity for identification of DVT and created an algorithm to accurately identify SVT. Using a multifaceted analysis approach, we were able to accurately further subclassify the anatomic location of thrombosis on ultrasound reports. This tool and the developed algorithms will allow analysis of large data sets with minimal effort and great accuracy, capitalizing on the power of large electronic datasets to offer new insights on pathophysiology and clinical prognosis.

Disclosures

No relevant conflicts of interest to declare.

Sign in via your Institution